AITopics | multi-pass sgd

Collaborating Authors

multi-pass sgd

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

description of our method

Neural Information Processing SystemsApr-25-2026, 07:25:58 GMT

Algorithm 2 Procedure for estimating the weights 1: procedure ESTIMATEWEIGHTS( Teacher,Student,V,D) 2:.V is the validation dataset and D is the teacher-labeled dataset 3: U, k d12 p |V|e 4: for every (x,y) V do 5: X (Confidence(Teacher(x)),Confidence(Student(x))) 6: if arg max(Teacher(x)) = arg max(y) then: 7: (p,distortion) (0,1) 8: else: B.1 The student's test-accuracy-trajectory In this section we provide extended experimental results that show the student's test accuracy over the training trajectory corresponding to experiments we mentioned in Section 3.1. Notice that in the vast majority of cases our method significantly outperforms the conventional approach almost throughout the training process. The student's test accuracy over the training trajectory using harddistillation corresponding to the experiments of Figure 4. See Section 3.1.2 The student's test accuracy over the training trajectory corresponding to the experiments of Figure 5. See Section 3.1.2 The student's test accuracy over the training trajectory corresponding to the experiments of Figure 7. See Section 3.1.3 The student's test accuracy over the training trajectory using hard-distillation (first row) and soft-distillation (second row) corresponding to the experiments of Figure 8. See Section 3.1.4 Indeed, it is known (see e.g.

artificial intelligence, machine learning, objective, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)

Add feedback

RiskBoundsofMulti-PassSGDforLeastSquaresin theInterpolationRegime

Neural Information Processing SystemsFeb-9-2026, 00:44:18 GMT

Despite the extensive application of multi-pass SGD in practice, there are only a few theoretical techniques being developed to study the generalization of multi-pass SGD.

artificial intelligence, machine learning, sgd, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Asia > China > Hong Kong (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Add feedback

RiskBoundsofMulti-PassSGDforLeastSquaresin theInterpolationRegime

Neural Information Processing SystemsFeb-9-2026, 00:44:15 GMT

Despite the extensive application of multi-pass SGD in practice, there are only a few theoretical techniques being developed to study the generalization of multi-pass SGD.

artificial intelligence, machine learning, sgd, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Asia > China > Hong Kong (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.71)

Add feedback

2e3435554b430bd8fe92a60c509929a0-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 03:27:12 GMT

adv, experiment, objective, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)

Add feedback

Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

Neural Information Processing SystemsDec-24-2025, 05:31:50 GMT

Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to provide an instance-dependent excess risk bound of multi-pass SGD for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation error, suggesting that SGD always performs worse, instance-wisely, than GD, in generalization. On the other hand, we show that although SGD needs more iterations than GD to achieve the same level of excess risk, it saves the number of stochastic gradient evaluations, and therefore is preferable in terms of computational time.

excess risk, multi-pass sgd, name change, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.83)

Add feedback

Generalization_SGD (46)

Courtney Paquette

Neural Information Processing SystemsAug-19-2025, 16:02:09 GMT

We include some additional preliminary background information.

artificial intelligence, machine learning, sgd, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

543924fdf260ba990f2ef84f940f3db2-Supplemental-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 21:37:40 GMT

excess risk, multi-pass sgd, sgd, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > China > Hong Kong (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.46)

Industry: Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)

Add feedback

543924fdf260ba990f2ef84f940f3db2-Paper-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 21:37:36 GMT

excess risk, multi-pass sgd, sgd, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.46)

Industry: Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.98)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)

Add feedback

Improved Scaling Laws in Linear Regression via Data Reuse

Lin, Licong, Wu, Jingfeng, Bartlett, Peter L.

arXiv.org Machine LearningJun-11-2025

Neural scaling laws suggest that the test error of large language models trained online decreases polynomially as the model size and data size increase. However, such scaling can be unsustainable when running out of new data. In this work, we show that data reuse can improve existing scaling laws in linear regression. Specifically, we derive sharp test error bounds on $M$-dimensional linear models trained by multi-pass stochastic gradient descent (multi-pass SGD) on $N$ data with sketched features. Assuming that the data covariance has a power-law spectrum of degree $a$, and that the true parameter follows a prior with an aligned power-law spectrum of degree $b-a$ (with $a > b > 1$), we show that multi-pass SGD achieves a test error of $Θ(M^{1-b} + L^{(1-b)/a})$, where $L \lesssim N^{a/b}$ is the number of iterations. In the same setting, one-pass SGD only attains a test error of $Θ(M^{1-b} + N^{(1-b)/a})$ (see e.g., Lin et al., 2024). This suggests an improved scaling law via data reuse (i.e., choosing $L>N$) in data-constrained regimes. Numerical simulations are also provided to verify our theoretical findings.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Machine Learning

2506.08415

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization

Vansover-Hager, Shira, Koren, Tomer, Livni, Roi

arXiv.org Machine LearningMay-16-2025

We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $Θ(1/\sqrt{n})$ excess population loss given a sample of size $n$, much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $η= Θ(1/\sqrt{n})$, which gives the optimal rate after one pass, can lead to population loss as large as $Ω(1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $Θ(1/(ηT) + η\sqrt{T})$, where $T$ is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after $O(n \log n)$ steps. Finally, we also prove a lower bound of $Ω(η\sqrt{n})$ on the generalization gap of one-pass SGD in dimension $d = \smash{\widetilde O}(n)$, improving on recent results of Koren et al.(2022) and Schliserman et al.(2024).

artificial intelligence, machine learning, sgd, (16 more...)

arXiv.org Machine Learning

2505.08306

Country: Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback